Pictured above is a car crash on the Bay Bridge, in Maryland.
In order to demonstrate the "Data Science Pipeline", I am using the State of Maryland's vehicular accident/crash data. It should come as no surprise to hear that driving is one of the most dangerous forms of transportation. It's estimated that 2 out of 3 drivers will get into an injury accident in their life (i.e. a vehicular accident that causes physical injury). Aggressive driving and speeding are to blame for most car accidents. Understanding and promoting awareness about the real dangers of driving is incredibly important. It's something that many of us take for granted, but can become very dangerous very quickly.
sources: https://www.drive-safely.net/driving-statistics/ & https://www.nbcnews.com/news/us-news/heroic-car-crash-witness-saves-toddler-who-was-ejected-maryland-n1266212
After looking through many data sets from https://opendata.maryland.gov/, I eventually settled on Maryland Statewide Vehicle Crashes. It provided me with data ranging from January 2015 to December 2020, from all the counties in Maryland. Not only is it relatively recent, but it provided me with a lot to work with. This website has data sets for just about everything related to Maryland, though! I got my data from https://opendata.maryland.gov/Public-Safety/Maryland-Statewide-Vehicle-Crashes/65du-s3qu. There is a handy link to download the csv file, and the website has a visualization feature too!
!pip install folium
import folium
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
Collecting folium
Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
|████████████████████████████████| 94 kB 1.9 MB/s eta 0:00:01
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from folium) (2.25.1)
Requirement already satisfied: jinja2>=2.9 in /opt/conda/lib/python3.8/site-packages (from folium) (2.11.2)
Collecting branca>=0.3.0
Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Requirement already satisfied: numpy in /opt/conda/lib/python3.8/site-packages (from folium) (1.19.5)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.8/site-packages (from jinja2>=2.9->folium) (1.1.1)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (2020.12.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->folium) (1.26.3)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1
#Read comma-separated vales into a dataframe (called t).
t = pd.read_csv("https://opendata.maryland.gov/api/views/65du-s3qu/rows.csv?accessType=DOWNLOAD")
t.head()
/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3146: DtypeWarning: Columns (34,46) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
| YEAR | QUARTER | LIGHT_DESC | LIGHT_CODE | COUNTY_DESC | COUNTY_NO | MUNI_DESC | MUNI_CODE | JUNCTION_DESC | JUNCTION_CODE | ... | FEET_MILES_FLAG_DESC | FEET_MILES_FLAG | DISTANCE_DIR_FLAG | REFERENCE_NO | REFERENCE_TYPE_CODE | REFERENCE_SUFFIX | REFERENCE_ROAD_NAME | LATITUDE | LONGITUDE | LOCATION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | Q2 | Daylight | 1.00 | Baltimore | 3.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 39.277263 | -76.503693 | POINT (-76.5036932 39.27726285) |
| 1 | 2020 | Q2 | NaN | 6.02 | Baltimore City | 24.0 | NaN | NaN | Non Intersection | 1.0 | ... | Miles | M | N | NaN | NaN | NaN | NORTH AVE | 39.311025 | -76.616429 | POINT (-76.616429453205 39.311024794431) |
| 2 | 2020 | Q2 | Daylight | 1.00 | Montgomery | 15.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 39.140680 | -77.193413 | POINT (-77.193412729561 39.140680249069) |
| 3 | 2017 | Q2 | Daylight | 1.00 | Baltimore City | 24.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 39.282928 | -76.635215 | POINT (-76.6352150952347 39.2829284750108) |
| 4 | 2020 | Q2 | Daylight | 1.00 | Cecil | 7.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 39.611028 | -75.951314 | POINT (-75.951314 39.611027833333) |
5 rows × 56 columns
Not only is data (sometimes) read messily into a dataframe, but I certainly did not need all 56 columns that my data set provided me with. I was able to get rid of most of them, including columns 34 and 46 (that were eliciting a warning above). Most of the columns containted numerical representations of the descriptions, and the key was not provided so the numbers were virtually useless. I ended up keeping:
#data tidying (mention getting rid of cols 34 and 46 as they gave a warning)
t = t.filter(['YEAR', 'QUARTER', 'LIGHT_DESC', 'COUNTY_DESC', 'JUNCTION_DESC', 'SURF_COND_DESC', 'REPORT_TYPE', 'ACC_DATE', 'ACC_TIME', 'LATITUDE', 'LONGITUDE'])
t.head()
| YEAR | QUARTER | LIGHT_DESC | COUNTY_DESC | JUNCTION_DESC | SURF_COND_DESC | REPORT_TYPE | ACC_DATE | ACC_TIME | LATITUDE | LONGITUDE | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | Q2 | Daylight | Baltimore | NaN | NaN | Property Damage Crash | 20200618 | 15:15:00 | 39.277263 | -76.503693 |
| 1 | 2020 | Q2 | NaN | Baltimore City | Non Intersection | Dry | Injury Crash | 20200430 | 06:39:00 | 39.311025 | -76.616429 |
| 2 | 2020 | Q2 | Daylight | Montgomery | NaN | NaN | Injury Crash | 20200504 | 09:46:00 | 39.140680 | -77.193413 |
| 3 | 2017 | Q2 | Daylight | Baltimore City | NaN | NaN | Injury Crash | 20170507 | 10:39:00 | 39.282928 | -76.635215 |
| 4 | 2020 | Q2 | Daylight | Cecil | NaN | NaN | Property Damage Crash | 20200414 | 17:32:00 | 39.611028 | -75.951314 |
The column names were changed to make things easier to type, to take up less space, and because I did not like them before. They all have the same meanings as before, they are just a little more concise now.
#columns renamed
t.columns = ['Year', 'Qtr', 'L/D', 'County', 'Junc', 'Surface', 'Severity', 'Date', 'Time', 'Lat', 'Long']
t.head()
| Year | Qtr | L/D | County | Junc | Surface | Severity | Date | Time | Lat | Long | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | Q2 | Daylight | Baltimore | NaN | NaN | Property Damage Crash | 20200618 | 15:15:00 | 39.277263 | -76.503693 |
| 1 | 2020 | Q2 | NaN | Baltimore City | Non Intersection | Dry | Injury Crash | 20200430 | 06:39:00 | 39.311025 | -76.616429 |
| 2 | 2020 | Q2 | Daylight | Montgomery | NaN | NaN | Injury Crash | 20200504 | 09:46:00 | 39.140680 | -77.193413 |
| 3 | 2017 | Q2 | Daylight | Baltimore City | NaN | NaN | Injury Crash | 20170507 | 10:39:00 | 39.282928 | -76.635215 |
| 4 | 2020 | Q2 | Daylight | Cecil | NaN | NaN | Property Damage Crash | 20200414 | 17:32:00 | 39.611028 | -75.951314 |
After renaming columns, some new columns had to be added (for convinience later). I added a datetime object, using the Date column from the dataframe. I also added a DaysSinceStart counter. I figured it would give me a little more to play around with later on, so I'm not soley relying on the datetime object. I also added a Y/NSevere column. This column has a 1 in it if the crash was classified as Injury Crash or Fatal Crash. It received a 0 otherwise. And, lastly, I added a %Light column to give this categorical variable some sense of numerical. I classified Day light as 1.0, Dawn/Dusk as .5, Dark Lights On as .25, and everything else as 0 (as it's in the darkness).
I also decided to drop any rows that were missing data. My data set has 666K rows, so I could lose a few without impacting the results. I also did not have a meaningful average for the categorical variables, so it made more sense to just exclude them.
#DateTime column created
t['DateTime'] = pd.to_datetime(t['Date'], format="%Y%m%d")
#DaysSinceStart column creted
start = pd.to_datetime("20150101", format="%Y%m%d")
t['DaysSinceStart'] = t['DateTime'] - start
#Y/NSevere column created
t['Y/NSevere']=t['Severity'].apply(lambda x: 1 if x == "Injury Crash" or x == "Fatal Crash" else 0)
#%Light column created
t['%Light']=t['L/D'].apply(lambda x: 1.0 if x == "Daylight" else .5 if x == "Dawn" or x == "Dusk" else
.25 if x == "Dark Lights On" else 0)
#Missing data is excluded
t = t.dropna()
t.head()
| Year | Qtr | L/D | County | Junc | Surface | Severity | Date | Time | Lat | Long | DateTime | DaysSinceStart | Y/NSevere | %Light | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 2020 | Q2 | Daylight | Baltimore City | Intersection | Wet | Property Damage Crash | 20200426 | 15:14:00 | 39.312903 | -76.651472 | 2020-04-26 | 1942 days | 0 | 1.00 |
| 18 | 2020 | Q2 | Daylight | Kent | Non Intersection | Dry | Property Damage Crash | 20200510 | 19:16:00 | 39.349181 | -75.878126 | 2020-05-10 | 1956 days | 0 | 1.00 |
| 21 | 2020 | Q2 | Dark Lights On | Queen Anne's | Non Intersection | Dry | Injury Crash | 20200513 | 01:24:00 | 39.204898 | -76.052318 | 2020-05-13 | 1959 days | 1 | 0.25 |
| 53 | 2020 | Q2 | Dark No Lights | Carroll | Non Intersection | Dry | Property Damage Crash | 20200618 | 23:31:00 | 39.478947 | -77.068709 | 2020-06-18 | 1995 days | 0 | 0.00 |
| 56 | 2020 | Q2 | Daylight | Howard | Non Intersection | Dry | Property Damage Crash | 20200605 | 16:27:00 | 39.222951 | -76.808682 | 2020-06-05 | 1982 days | 0 | 1.00 |
In order to try to identify trends or patterns, the data will be visualized in maps and graphs.
#yearly variable created by grouping the original dataframe by the years. it was then used to make the line plot.
yearly = t.groupby('Year')
yearly = yearly.size()
yearly.plot(kind='line', title='Crashes by Year', xticks=[2016, 2017, 2018, 2019, 2020], xlabel='Year',
ylabel='Number of Crashes', figsize=(10, 8))
<AxesSubplot:title={'center':'Crashes by Year'}, xlabel='Year', ylabel='Number of Crashes'>
To get started, I decided to look at the yearly trends. After grouping all the data by the year, and plotting it, it is pretty clear that something happened in 2018. After reaching its peak, the number of crashes drastically dropped by about 25,000 crashes in 2020. I felt the need to break this down more, to see what's really going on.
#qtrly variable created by grouping the original dataframe by the quarters. it was then used to make the line plot.
qtrly = t.groupby('Qtr')
qtrly = qtrly.size()
qtrly.plot(kind='line', title='Crashes by Quarter', xlabel='Quarter',
ylabel='Number of Crashes', figsize=(10, 8))
<AxesSubplot:title={'center':'Crashes by Quarter'}, xlabel='Quarter', ylabel='Number of Crashes'>
After breaking things down into quarters, the trends continue. The majority of crashes occurred in the fourth quarter (the winter months). Perhaps due to inclimate weather? Perhaps due to the many holidays? Further breaking down is needed to understand why.
#timely variable created by grouping the original dataframe by their times (of day). it was then used to make the
#line plot.
timely = t.groupby('Time')
timely = timely.size()
timely.plot(kind='line', title='Crashes by Time of Day', xlabel='Time', ylabel='Number of Crashes', figsize=(10, 8))
<AxesSubplot:title={'center':'Crashes by Time of Day'}, xlabel='Time', ylabel='Number of Crashes'>
Not too surprisingly, the majority of crashes appear to occurr in the middle of the day, when the most people are on the roads. There is probably a lot more aggressive driving and speeding that occurrs when there are more people out on the roads. And, as mentioned in the introduction, they are the top two reasons for crashes. I'd like to have a look at some other variables though.
#ld variable created by grouping the original dataframe by their light classification. it was then used to make the
#pie graph.
ld = t.groupby('L/D')
ld = ld.size()
ld.plot(kind='pie', title='Crashes by Amount of Daylight', xlabel='Daylight', ylabel='Daylight', figsize=(10, 8))
<AxesSubplot:title={'center':'Crashes by Amount of Daylight'}, ylabel='Daylight'>
This is where things start to get a little confusing. I had thought, originally, that the most car accidents would likely occur at night (or at least at dusk/dawn). The last line graph (with the times) does uphold this pie chart. It appears the majority of crashes occur in the daylight.
#cond variable created by grouping the original dataframe by the surface conditions of the road. it was then used
#to make the pie graph.
cond = t.groupby('Surface')
cond = cond.size()
cond.plot(kind='pie', title='Crashes based on Road Conditions', xlabel='Surface Conditions', ylabel='Surface Conditions',
figsize=(10, 8))
<AxesSubplot:title={'center':'Crashes based on Road Conditions'}, ylabel='Surface Conditions'>
Even more surprisingly (to me), than the amount of light, is the conditions of the roads. For the majority of crashes, it appears that the roads are dry. I would have thought that wet conditions would lead to more crashes (and that's what they emphasized in Driver's Ed. too!)
#sev variable created by grouping the original dataframe by the severity of the aftermath. it was then used to make
#the pie graph.
sev = t.groupby('Severity')
sev = sev.size()
sev.plot(kind='pie', title='Crashes based on Severity', ylabel='Severity', figsize=(10, 8))
<AxesSubplot:title={'center':'Crashes based on Severity'}, ylabel='Severity'>
Luckily, only a small amount of crashes were fatal from 2015 to 2020! The majority appear to have been property damage crashes, which is understandable.
#because there are too many entries to plot on a map, I am randomly choosing 500 to geographically analyze.
sampledata = t.sample(n=500)
#the starting map is (roughly) centered about Maryland
map1 = folium.Map(location=[38.52, -76.61], zoom_start=7)
#for every row in the sample data, if it was within one of the selected counties, then it was given its associated
#color. otherwise, it was pink.
for index, row in sampledata.iterrows():
#Montgomery County is light blue
if row["County"] == "Montgomery":
folium.Marker(location=[row["Lat"], row["Long"]], icon=folium.Icon(color="lightblue", icon="asterisk")).add_to(map1)
#Prince George's County is light green
elif row["County"] == "Prince George's":
folium.Marker(location=[row["Lat"], row["Long"]], icon=folium.Icon(color="lightgreen", icon="asterisk")).add_to(map1)
#Baltimore County is purple
elif row["County"] == "Baltimore":
folium.Marker(location=[row["Lat"], row["Long"]], icon=folium.Icon(color="purple", icon="asterisk")).add_to(map1)
#Baltimore City is orange
elif row["County"] == "Baltimore City":
folium.Marker(location=[row["Lat"], row["Long"]], icon=folium.Icon(color="orange", icon="asterisk")).add_to(map1)
#Anne Arundel County is cadet blue
elif row["County"] == "Anne Arundel":
folium.Marker(location=[row["Lat"], row["Long"]], icon=folium.Icon(color="cadetblue", icon="asterisk")).add_to(map1)
#Any others are pink
else:
folium.Marker(location=[row["Lat"], row["Long"]], icon=folium.Icon(color="pink", icon="asterisk")).add_to(map1)
map1
#colored the most populated counties, and that's where the most accidents are...
When plotting on a map, it is important to remember not to over-generalize. With too few colors, or too few distinguishing icons, it can be difficult to visualize what's going on (which defeats the whole purpose). With that being said, the opposite holds too. Too many can also be counterintuitve. I thought that 23 different colors (for the 23 different counties) would be too much. I decided to color code the five most populated counties in Maryland. They were (in order) Montgomery County, Prince George's County, Baltimore County, Baltimore City, and Anne Arundel County. Technically speaking, Baltimore City was not part of the top five, but I differentiated all of Baltimore City's surrounding counties and I did not want it to get lost (so I changed its color with the other populated counties). The majority of crashes appears to occur in these heavily populated areas.
#plt.scatter(marker='.', x=t['DateTime'], y=t['County'])
plt.scatter(marker='.', x=t['DateTime'], y=t['%Light'])
X = np.array(t['Date'].values).reshape(-1, 1)
y = np.array(t['Y/NSevere'].values).reshape(-1,1)
#X = X.map(dt.datetime.toordinal)
#X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=.2,random_state=0)
#linreg = linear_model.LinearRegression()
#linreg.fit(X_train,Y_train)
#X = list(X).map(dt.datetime.toordinal)
linreg = linear_model.LinearRegression()
linreg.fit(X, y)
ce = linreg.coef_
itcpt = linreg.intercept_
plt.plot(X, ((ce * X)+itcpt))
While there is never one, simple answer to issues like this, we can certainly see some related factors. The time and location of the crashes in Maryland are pretty understandable, due to external factors. When there are more people out on the roads, it is more likely to get into a car accident. Even if you were incredibly safe, you cannot always trust others. This is also shown in the daylight/darkness chart. The majority of crashes occur when it is light out. And, it makes sense that there will be more crashes in more heavily populated areas. Baltimore County and Baltimore City definitely make sense. There is so much traffic crammed into a (relatively) small area. I was particularly shocked by the road conditions. I really thought that the majority of accidents would happen when the roads are wet and slick. But, I stand corrected. This reinforces the importance of reducing distracted driving. If external factors are not causing people to get into crashes, then it is likey the driver. It also makes perfect sense that the majority of crashes are property related damages. Ultimately, even if nobody is hurt, it is likely that the car(s) involved will have some sort of damage.
It would be interesting to see if this data continues into and after the 2021 year's worth of crashes. Looking at the year line graph, it appears that it may continue to decrease. A culture change has certainly helped with this, too. It's no longer "cool" to text and drive. Many people have begun to call out their friends and family when they see it. It will be interesting to see if that plays any role in the (possible) reduction in crashes next year.
If you would like more information on car crashes in Maryland, or information on the dangers of driving, these resources may be a good starting point.